code
share


Chapter 15: Calculus and automatic differentiation

15.1 What is a derivative?

The derivative is a simple tool for understanding a mathematical function locally - meaning at and around a single point. More specifically the derivative at a point defines the best linear approximation - a line in two dimensions, a hyperplane in higher dimensions - that matches the given function at that point as well as a line / hyperplane can.

Why would someone need / come up with such an idea? Because most of the mathematical functions we deal with in machine learning, mathematical optimization, and science in general are too high dimensional for us to examine by eye. Because they live in higher dimensions we need tools (e.g., calculus) to help us understand and intuit their behavior.

In [4]:

13.1.1 Derivatives at a point

Let us begin exploring this idea in pictures before jumping into the math. Lets examine a few candidate functions - beginning with the standard sinusoid

\begin{equation} g(w) = \text{sin}(w) \end{equation}

Below we draw this function over a small range of its inputs, and then at each point draw the line defined by the function's derivative there on top.

The final result is an animated slider widget - at each increment of the slider the sinusoidal function is drawn in black, the point we are at in red, and the corresponding line produced using the derivative in green. Sliding from left to right moves the point - and its associated derivative given line - smoothly across the function.

In [5]:
# what function should we play with?  Defined in the next line.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function 
taylor_viz = calclib.taylor2d_viz.visualizer(g = g)

# run the visualizer for our chosen input function
taylor_viz.draw_it(first_order = True,num_frames = 2)
Out[5]:



Notice a few things - first as you adjust the slider notice how the line produced by the derivative of the point is always tangent to the function. This is true more generally as well - for any function the linear approximation given by the derivative is tangent to the function at every point. Second - notice how the slope of the line defined by the derivative hugs the function at every point - it seems to match the general local steepness of the curve everywhere. This is also true in general: the slope of the tangent line given by the derivative always gives local steepness - or slope - of the function itself. The derivative naturally encodes this information. Third: notice how at each increment of the slider the tangent line defined by the derivative matches the function itself near the point in red. This is also true in general - the derivative at a point always defines a line that matches the underlying function near that point. In short - the derivative at a point is the slope of the tangent line at that point.

Lets examine another candidate function using the same widget toy

\begin{equation} g(w) = \text{sin}(4w) + 0.1w^2 \end{equation}
In [8]:
# what function should we play with?  Defined in the next line.
g = lambda w: np.sin(4*w) + 0.1*w**2

# create an instance of the visualizer with this function 
taylor_viz = calclib.taylor2d_viz.visualizer(g = g)

# run the visualizer for our chosen input function
taylor_viz.draw_it(first_order = True,num_frames = 1)
Out[8]:



Again as you slide from left to right you can see how the line defined by the derivative at each point stays tangent to the curve, hugs the function's shape everywhere, and generally matches the function near the point.

15.1.2 Secant lines

In the image below we show a picture of the sinusoid in the left panel, where we have plugged the input point $w^0 = 0$ into the sinusoid and highlighted the corresponding point $(0, \text{sin}(0))$ in green . In the middle panel we plot another point on the curve - with input $w^1 = -2.6$ the point $(-2.6, \text{sin}(-2.6) ) $ in blue , and the *secant line* in red formed by connecting $(-2.6, \text{sin}(-2.6) ) $ and $(0, \text{sin}(0))$ . Finally in the right panel we show the tangent line at $w = 0$ in lime green. The gray vertical dashed lines in the middle panel are there for visualization purposes only.

A secant line is just a line formed by taking any two points on a function - like our sinusoid - and connecting them with a straight line. On the other hand, while a tangent line can cross through several points of a function it is explicitly defined using only a single point. So in short - a secant line is defined by two points, a tangent line by just one.

The equation of any secant line is easy to derive - since all we need is the slope and any point on the line to define it - and the slope of a line can be found using any two points on it (like the two points we used to define the secant to begin with).

The slope - the line's 'steepness' or 'rise over run' - is the ratio of change in output $g(w)$ over the change in input $w$. If we used two generic inputs $w^0$ and $w^1$ - above we chose $w^0 = 0$ and $w^1 = -2.6$ - we can write out the slope of a secant line generally as

\begin{equation} \text{slope of a secant line} = \frac{g(w^1) - g(w^0)}{w^1 - w^0} \end{equation}

Now using the point-slope form of a line we can directly write out the equation of a secant using the slope above and either of the two points we used to define the secant to begin with - using $(w^0, g(w^0))$ we then have the equation of a secant line $h(w)$ is

\begin{equation} h(w) = g(w^0) + \frac{g(w^1) - g(w^0)}{w^1 - w^0}(w - w^0) \end{equation}

If we think about our green point at $w^0 = 0$ as fixed, then the tangent line at this point can be thought of as the line we get when we shift the blue point very close - infinitely close actually - to the green one.

Example 1. Secant line computation

Taking $w^0 = 0$ and $w^1 = -2.6$ the equation of the secant line connecting $(w^0,\text{sin}(w^0))$ and $(w^1,\text{sin}(w^1))$ on the sinusoid is given as

\begin{equation} h(w) = \text{sin}(0) + \frac{\text{sin}(-2.6) - \text{sin}(0)}{-2.6 - 0}(w - 0) \end{equation}

Since $\text{sin}(0) = 0$ and $\text{sin}(-2.6) \approx -0.5155$ we can write this as

\begin{equation} h(w) = \frac{0.5155}{2.6}w \end{equation}

15.1.3 From secant to tangent line

Below we show a slider-based animation widget that illustrates precisely this idea. As you shift the slider from left to right the blue point - along with the red secant line that passes through it and the green point - moves closer and closer to our fixed point. Finally - when the two points lie right on top of each other - the secant line becomes the green tangent line at our fixed point.

In [2]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200)
Out[2]:



In sliding back and forth, notice how it does not matter if we start from the left of our fixed point and move right towards it, or start to the right of the fixed point and move left towards it: either way the secant line gradually becomes tangent to the curve at $w^0 = 0$. There is no big 'jump' in the slope of the line if we wiggle the slider ever so slightly to the left or right of the fixed point - the slopes of the nearby secant lines are very very similar to that of the tangent.

When we can do this - come at a fixed point from either the left or the right and the secant line becomes tangent smoothly from either direction with no jump in the value of the slope - we say that a function has a derivative at this point, or likewise say that it is differentiable at the point.

Example 2. The hyperbolic tangent, squared

Many functions like our sinusoid, other trigonometric functions, and polynomials are differentiable at every point - or just differentiable for short. In the Jupyter notebook version of this Section you can tinker around with the previous Python cell - pick another fixed point! - and see this for yourself. You can also tinker around with the function - for example in the next cell we show - using the same slider mechanism - that the function

\begin{equation} g(w) = \text{tanh}(w)^2 \end{equation}

has a derivative at the point $w^0$ = 1.

In [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.tanh(w)**2

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 1, num_frames = 300)
Out[3]:



Example 3. An example of failure: the rectified linear unit

Notice: that the slope of the secant line must smoothly change to the slope of the tangent line from both directions - from both the left and right - is important to this definition. There are plenty of functions where this does not occur at every point, like the function

\begin{equation} g(w) = \text{max}(0,w) \end{equation}

at the point $w^0 = 0$. This function is called a rectified linear unit or relu for short. Using the slider widget we can see that the slope of the secant line visibly jumps at this point. Move the slider back and forth around where $w = 0$ and watch the slope of the secant jump distinctly from zero to one. Because the slopes of the secant lines just to the left and right of the fixed point $w^0 = 0$ fail to line up, the function does not have a derivative here. So try as you might, the line will never turn green.

In [4]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.maximum(w,0)

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200,mark_tangent = False)
Out[4]:



15.1.4 From secant slope to derivative

With this in mind how can we compute the equation of a tangent line at some point $w^0$ for a given function? More specifically, how can we compute the derivative here - or the slope of this tangent line? Well we know that if we take another point $w_1$ on either side of $w^0$ and connect the two - creating the secant line with equation

\begin{equation} h(w) = g(w^0) + \frac{g(w^1) - g(w^0)}{w^1 - w^0}(w - w^0) \end{equation}

that as we push $w^1$ ever closer towards $w^0$ that this secant becomes our tangent line when $w^1 \approx w^0$. Now note $w^1$ appears only in the slope of this equation, hence the slope of this line is the only quantity that changes as $w^1$ gets closer to $w^0$ and the secant line becomes tangent at $w^0$. This is great because now in our aim to understand the tangent line we can focus our attention solely on what is happening with the slope of the secant - which is precisely the derivative (the slope of the tangent line) that we are after.

Now, remember that the slope of a line measures its slope, or 'rise over run' which is the change in its vertical value ($g(w^1) - g(w^0)$) over the change in its horizontal value ($w^1 - w^0$). In other words

\begin{equation} \text{slope of secant line} = \frac{\text{change in $g$}}{\text{change in $w$}} = \frac{g(w^1) - g(w^0)}{w^1 - w^0} \end{equation}

As $w^1$ inches ever closer to $w^0$ - from either the left or the right of $w^0$ - the change in both $g$ and $w$ becomes incredibly small or infinitesimal. And this is how the derivative is conceptually defined: as the slope of a secant line where $w^1$ is so close to $w^0$ that the change in $g$ and $w$ are both infinitesimal. And remember: the value of this slope needs to be the same whether or not $w^1$ lies to the left or right of $w^0$.

15.1.5 Refining the definition of the derivative

Lets quantify more explicitly using math notation what this definition means, first by backing off the 'infinitesimally small' part for a moment - lets just make the difference very small. We can define a generic point very close to and to the right of $w^0$ by denoting by $\epsilon$ some small positive number (e.g., $\epsilon = 0.0001$), then the point $w^1 = w^0 + \epsilon$ is indeed quite close to $w^0$. Following, then the slope of the secant line connecting $(w^0,g(w^0))$ to $(w^0 + \epsilon, g(w^0 + \epsilon))$ is given as

\begin{equation} \frac{g(w^1) - g(w^0)}{w^1 - w^0} = \frac{g(w^0 + \epsilon) - g(w^0)}{w^0 + \epsilon - w^0} = \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} \end{equation}

To ensure that this value is indeed close to the derivative value we need to check that the slope of this secant line is very similar to the slope of a secant based at $w_0$ and going through a point slightly to the left of $w^0$. Taking the same value for $\epsilon$ we can take the point $w^0 - \epsilon$ which lies just to the left of $w^0$. Forming the secant connecting points $(w^0, g(w^0))$ and $(w^0 - \epsilon, g(w^0 - \epsilon))$ we can compute its slope as

\begin{equation} \frac{g(w^1) - g(w_0)}{w^1 - w^0} = \frac{g(w^0 - \epsilon) - g(w^0)}{w^0 - \epsilon - w^0} = - \frac{g(w^0 - \epsilon) - g(w^0)}{\epsilon} \end{equation}

If there is indeed a derivative at $w^0$ then the value of this slope needs to closely match the slope of our first secant, or in other words

\begin{equation} \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} \approx - \frac{g(w^0 - \epsilon) - g(w^0)}{\epsilon} \end{equation}

And - moreover - as we make $\epsilon$ smaller and smaller these two quantities should both settle down to one value, and be perfectly equal to each other.

Notice that we can express this more compactly if we let $\epsilon$ represent a small (in magnitude) positive or negative number. Then we can say equivalently that we desire that the quantity

\begin{equation} \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} \end{equation}

to settle down as we make $\epsilon$ smaller and smaller in magnitude. We can still think about this more compact formula as representing the slope of secant lines on either side of $w^0$, getting ever closer on both sides to $w^0$ we make the magnitude of $\epsilon$ infinitesimally small.

Writing this algebraically we say that we want the value $ \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} $ to converge to a single value as $\vert\epsilon\vert \longrightarrow 0$.

Common notations for the derivative

One common notation used to denote this ratio of infinitesimal changes $\frac{\text{infinitesimal change in $g$}}{\text{infinitesimal change in $w$}}$ is $\frac{\mathrm{d}g}{\mathrm{d}w}$. Here the symbol $\mathrm{d}$ means 'infinitely small change in the value of'. A common variation on this notation puts the $g$ out front, like this $ \frac{\mathrm{d}}{\mathrm{d}w}g$. In short - we have both the definition and symbol to denote a general derivative of $g$ at any point as

\begin{equation} \text{derivative} = \frac{\text{infinitesimal change in $g$}}{\text{infinitesimal change in $w$}}:= \frac{\mathrm{d}g}{\mathrm{d}w} \,\,\, \text{or} \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}g \end{equation}

There are other notations commonly used in practice to denote the derivative, but we will stick to using these.

To denote the derivative at a specific point $w^0$ we will write

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \end{equation}

With this notation the equation of the tangent line to $g$ at the point $w^0$ is written as

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0). \end{equation}

Example 4. Computing approximate derivatives at a point

Take our sinusoid, the point $w^0 = 0$, and a small magnitude value for $\epsilon$ like $\epsilon = 0.0001$. Computing the slope of a secant line where $w^1 = w^0 + \epsilon$ lies just to the right of $w^0$ we have

\begin{equation} \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} = \frac{\text{sin}(0.0001)}{0.0001}\approx 0.99999 \end{equation}

Likewise computing the slope of the secant line where $w^1 = w^0 - \epsilon$ lies just to the left of $w^0$ we have

\begin{equation} -\frac{g(w^0 - \epsilon) - g(w^0)}{\epsilon} = -\frac{\text{sin}(-0.0001)}{0.0001}\approx 0.99999 \end{equation}

Indeed both slopes are approximately equal, so we can definitively say at $w^0 = 0$

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \approx 0.99999 \end{equation}

Using this we can write out the equation of the tangent line to the sinusoid at $w^0 = 0$ as

\begin{equation} h(w) = \text{sin}(0) + 0.9999(w - 0) = 0.9999w \end{equation}

Example 5. Checking non-differentiability at $w = 0$ for the relu function

Checking differentiability of the relu function

\begin{equation} g(w) = \text{max}(0,w) \end{equation}

at $w^0 = 0$ we have that the slope of a secant where $w^1 = w^0 + \epsilon$ for any small $\epsilon > 0$ (e.g.,$\epsilon = 0.0001$) coming from the right

\begin{equation} \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} = \frac{\text{max}(0,0.0001)}{0.0001}= \frac{0.0001}{0.0001} = 1 \end{equation}

A similar computation where $w^1 = w^0 - \epsilon$ comes in from the left gives

\begin{equation} -\frac{g(w^0 - \epsilon) - g(w^0)}{\epsilon} = -\frac{\text{max}(0,-0.0001)}{0.0001}= -\frac{0}{0.0001} = 0 \end{equation}

Since these two secant slopes do not match up, the function is not differentiable at $w^0 = 0$, and these computations hold regardless of the magnitude of $\epsilon$.

15.1.5 Numerical Differentiation

If we want to make a computer program that estimates the derivative of some function at a point, we could simply implement the definition of the derivative given previously in equation (14) setting $\epsilon$ to some small positive value. The problem - however - is that setting the value of $\epsilon$ properly is no easy chore.

In the next Python cell we provide a Python class that simply implements the above numerical definition of the derivative for a user-defined choice of $\epsilon$. Those wanting a good introduction to Python classes, in particular for implementing mathematical functions and objects, can see e.g., this excellent book.

In [5]:
class numerical_derivative:
    '''
    A function for computing the numerical derivative
    of an arbitrary input function and user-chosen epsilon
    '''
    def __init__(self, g):
        # load in function to differentiate
        self.g = g; self.epsilon = 10*-5

    def __call__(self, w,**kwargs):
        # make local copies 
        g, epsilon = self.g, self.epsilon 
        
        # set epsilon to desired value or use default
        if 'epsilon' in kwargs:
            epsilon = kwargs['epsilon']
        
        # compute derivative approximation and return
        approx = (g(w+epsilon) - g(w))/epsilon
        return approx

While the functionality is very simple we will see the examples here that it can be difficult in practice to set the value of $\epsilon$ correctly.

Example 6. A simple sinusoid

Lets check that this function will indeed compute accurate derivatives for a simple function whose derivatives we can visually verify are correct / incorrect

\begin{equation} g(w) = \text{sin}(w) \end{equation}

This elementary function actually has an algebraic formula for its derivative - as we will see in the next Section - which is given by $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \text{cos}(w)$.

In the next Python cell we run a fine grid of points on the interval [-5,5] through our Numerical Differentiator and plot the result - along with the original function.

In [6]:
# make function, create derivative
g = lambda w: np.sin(w)
der = numerical_derivative(g)

# evaluate the derivative over this range of input
wvals = np.linspace(-3,3,100)
gvals = [g(w) for w in wvals]
dervals = [der(w,epsilon = 10**-2) for w in wvals]

# plot function and derivative
plt.plot(wvals,gvals,color = 'k',label = 'original function')
plt.plot(wvals,dervals,color = 'r',label = 'numerical derivative') 
plt.legend(bbox_to_anchor=(1.05, 1), loc=2,fontsize = 12); plt.xlabel('$w$')
plt.show()

Looks good! Here we used the particular value of $\epsilon = 10^{-2}$. Lets see what happens to our numerical derivative as we adjust the value of $\epsilon$.

In the next Python cell we use a widget to animate a wide selection of values for $\epsilon$ from $1$ to $10^{-17}$. In each slide the function is plotted in black, the true derivative in dashed blue, and the numerical derivative with chosen value of $\epsilon$ (printed on the title of each slide) is shown in red. As you move the slider from left to right the value of $\epsilon$ becomes exponentially smaller.

In [7]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function
st = calclib.numder_silder.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it()
Out[7]:



By playing around with the slider you can see that by around $\epsilon = 10^{-3}$ the numerical approximation of the derivative is virtually perfect. So a value for $\epsilon$ at least this small would work perfectly fine for this example.

Notice, however, that not every value smaller than $10^{-3}$ is a good option. When $\epsilon = 10^{-16}$ and smaller things start to look bad, the numerical derivative becomes terrible and by $10^{-18}$ it is zero in many places.

There in lies the rub: we need to set $\epsilon$ small for the approximation to theoretically be close to the actual derivative value, but set $\epsilon$ too small creates a second problem called round-off error. Numerical values - whether or not they are produced from a mathematical function - can only be represented up to a certain accuracy on a computer. In particular, we always have a tough time representing fractional numbers $\frac{a}{b}$ where both $a$ and $b$ are close to zero. But - as we make $\epsilon$ small - this is precisely what becomes of the approximation

$$ \frac{ g(w + \epsilon) - g(w)}{\epsilon} $$

since both the top (since the values $g(w + \epsilon)$ and $g(w)$ become essentially identical) and bottom of this fraction become incredibly small as we shrink the value of $\epsilon$.


Example 7. A rapidly changing approximation to the derivative

Take for example the function

\begin{equation} g(w) = \frac{\text{cos}(40w)^{100}}{w^2 + 1} \end{equation}

printed out by the next Python cell. In this cell we use a widget that animates a wide selection of values for $\epsilon$ from $1$ to $10^{-17}$. In each slide the function is plotted in black, the true derivative in dashed blue, and the numerical derivative with chosen value of $\epsilon$ (printed on the title of each slide) is shown in red.

In [8]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.cos(40*w)**100/(w**2 + 1)

# create an instance of the visualizer with this function
st = calclib.numder_silder.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it()
Out[8]:



As compared to the first example, here we need to set $\epsilon$ considerably smaller - to around $10^{-6}$ - in order for the numerical derivative to approximate the true derivative well. Once again pushing the slider all the way to the right - making $\epsilon$ very small - also results in a poor approximation due to round-off error.

15.1.6 From tangent line to tangent hyperplane

Instead of the derivative representing the slope of a tangent line in the case of a single-input function, the derivative of a multi-input function represents the set of slopes that define a tangent hyperplane.

Example 1. Tangent hyperplane

This is illustrated in the next Python cell using the following two closely related functions

\begin{array} \ g(w) = 2 + \text{sin}(w)\\ g(w_1,w_2) = 2 + \text{sin}(w_1 + w_2) \end{array}

In particular we draw each function over a small portion of its input around the origin, with the single-input function on the left and multi-input function on the right. We also draw the tangent line / hyperplane - generated by the derivative there - on top of each function at the origin.

In [2]:
# plot a single input quadratic in both two and three dimensions
func1 = lambda w: 2 + np.sin(w) 
func2 = lambda w: 2 + np.sin(w[0] + w[1]) 

# use custom plotter to show both functions
callib.derivative_ascent_visualizer.compare_2d3d(func1 = func1,func2 = func2)

Here we can see that the derivative for the multi-input function on the right naturally describes not just a line, but a tangent hyperplane. This is true in general. How do we define the derivative of a multi-input function / the tangent hyperplane it generates?

15.1.7 Derivatives: from secants to tangents

Above we saw how the derivative of a single input function $g(w)$ at a point $w^0$ was approximately the slope

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \approx \frac{g(w^0 + \epsilon) - g(w^0)}{\epsilon} \end{equation}

of the secant line passing through the point $(w^0,\,\,g(w^0))$ and a neighboring point $(w^0 + \epsilon, \,\, g(w^0 + \epsilon))$, and letting $|\epsilon|$ shrink to zero this approximation becomes an equality, and the derivative is precisely the slope of the tangent line at $w^0$. With $N$ inputs we have precisely the same situation - only we can compute a derivative along each input axis, and all such derivatives at every point of the input space.

For example, if we fix a point $(w_1,w_2) = (w^0_1,w^0_2)$ then we can examine the derivative along the first input axis $w_1$ using the same one-dimensional secant slope formula

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w_1}g(w^0_1,w^0_2) \approx \frac{g(w^0_1 + \epsilon,w^0_2) - g(w^0_1,w^0_2)}{\epsilon} \end{equation}

and again as $|\epsilon|$ shrinks to zero this approximation becomes an equality. Since we are in two dimensions the secant line with this slope is actually a hyperplane passing through the points $(w^0_1,w^0_2,g(w^0_1,w^0_2))$ and $(w^0_1 + \epsilon,w^0_2,g(w^0_1 + \epsilon,w^0_2))$. Likewise to compute the derivative in the second input axis $w_2$ here we compute the slope value

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w_2}g(w^0_1,w^0_2) \approx \frac{g(w^0_1 ,w^0_2 + \epsilon) - g(w^0_1,w^0_2)}{\epsilon} \end{equation}

Because each of the derivatives $\frac{\mathrm{d}}{\mathrm{d}w_1}g(w^0_1,w^0_2)$ and $\frac{\mathrm{d}}{\mathrm{d}w_2}g(w^0_1,w^0_2)$ is taken with respect to a single input, they are referred to as partial derivatives of the function $g(w_1,w_2)$.

Moreover, more commonly one uses a different notation to distinguish them from single input derivatives - replacing the $\mathrm{d}$ symbol with $\partial$. With this notation derivatives above are written equivalently as $\frac{\partial}{\partial w_1}g(w^0_1,w^0_2)$ and $\frac{\partial}{\partial w_2}g(w^0_1,w^0_2)$. Regardless of the notation partial derivatives are computed - as we will discuss in the next Sectionst - in virtually the same manner as single-input derivatives are (i.e., via repeated use of the derivative rules for elementary functions and operations).

This nomenclature and notation is used more generally as well to refer to any derivative of a multi-input function with respect to a single input dimension.

Example 3. Multi-input secant experiment

In the next Python cell we repeat the secant experiment - shown previously for a single-input function - for the following multi-input function

\begin{equation} g(w_1,w_2) = 5 + (w_1 + 0.5)^2 + (w_2 + 0.5)^2 \end{equation}

We fix the point $(w^0_1,w^0_2) = (0,0)$ and take a point along each axis whose proximity to the origin can be controlled via the slider mechanism. At each instance and in each input dimension we form a secant line (which is a hyperplane in three dimensions) connecting the evaluation of this point to that of the origin. The secant hyperplane whose slope is given by the partial derivative approximation $\frac{\partial}{\partial w_1}g(w^0_1,w^0_2)$ and $\frac{\partial}{\partial w_2}g(w^0_1,w^0_2)$ are then illustrated in the left and right panels of the output, respectively.

When the neighborhood point is close enough to the origin the secant becomes tangent in each input dimension, and the corresponding hyperplane changes color from red to green.

In [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
func = lambda w: 5 + (w[0] +0.5)**2 + (w[1]+0.5)**2 
view = [20,150]

# run the visualizer for our chosen input function and initial point
callib.secant_to_tangent_3d.animate_it(func = func,num_frames=50,view = view)
Out[3]:




The hyperplanes at a point $(w^0_1,w^0_2)$ are tangent along each input dimension only - like those shown in the figure above - have a slope defined by the corresponding partial derivative. Each such hyperplane is rather simple, in the sense that it has non-trivial slope in only a single input axis (we discussed this more generally in the previous Section), and have a single-input form of equation. For example, the tangent hyperplane along the $w_1$ axis has the equation

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_1}g(w^0_1,w^0_2)(w^{\,}_1 - w^0_1) \end{equation}

and likewise for the tangency along the $w_2$ axis.

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_2}g(w^0_1,w^0_2)(w^{\,}_2 - w^0_2) \end{equation}

However neither simple hyperplane represents the full tangency at the point $(w^0_1,w^0_2)$, which must be a function of both inputs $w_1$ and $w_2$. To get this we must sum up the slope contributions from both input axes, which gives the full tangent hyperplane

\begin{equation} h(w_1,w_2) = g(w^0_1,w^0_2) + \frac{\partial}{\partial w_1}g(w^0_1,w^0_2)(w^{\,}_1 - w^0_1) + \frac{\partial }{\partial w_2}g(w^0_1,w^0_2)(w^{\,}_2 - w^0_2) \end{equation}

As was the case with the tangent line of a single-input function, this is also the first order Taylor Series approximation to $g$ at the point $(w^0_1,w^0_2)$.

Example 4. Arbitrary tangent hyperplane

In the next Python cell we illustrate each single-input tangency (with respect to $w_1$ and $w_2$ in the left and middle panels respectively), along with the full first order Taylor series approximation with the full tangent hyperplane (right panel) for the example function show in the previous animation.

In [4]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
func = lambda w: 5 + (w[0] +0.5)**2 + (w[1]+0.5)**2
view = [10,150]

# run the visualizer for our chosen input function and initial point
callib.secant_to_tangent_3d.draw_it(func = func,num_frames=50,view = view)

So, in short, multi-input functions with $N=2$ inputs have $N=2$ partial derivatives, one for each input. At a single point taken together these partial derivatives - like the sole derivative of a single-input function - define the slopes of the tangent hyperplane at this point (also called the first order Taylor Series approximation).

15.1.8 The gradient

For notational convenience these partial derivatives are typically collected into a vector-valued function called the gradient denoted $\nabla g(w_1,w_2)$, where the partial derivatives column-wise as

\begin{equation} \nabla g(w_1,w_2) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g(w_1,w_2) \\ \frac{\partial}{\partial w_2}g(w_1,w_2) \end{bmatrix} \end{equation}

Note because this is a stack of two derivatives the gradient in this case has two inputs and two outputs. When a function has only a single input the gradient reduces to a single derivative, which is why the derivative of a function (regardless of its number of inputs) is typically just referred to as its gradient.

When a function takes in general $N$ number of inputs the form of the gradient, as well as the tangent hyperplane, mirror precisely what we have seen above. A function taking in $N$ inputs $g(w_1,w_2,...,w_N)$ has a gradient consisting of its $N$ partial derivatives stacked into a column vector

\begin{equation} \nabla g(w_1,w_2,..,w_N) = \begin{bmatrix} \ \frac{\partial}{\partial w_1}g(w_1,w_2,..,w_N) \\ \frac{\partial}{\partial w_2}g(w_1,w_2,..,w_N) \\ \vdots \\ \frac{\partial}{\partial w_N}g(w_1,w_2,..,w_N) \end{bmatrix} \end{equation}

To see why this is a convenient way to express the partial derivatives of $g$ note that using vector notation for the input, e.g., $\mathbf{w} = (w_1,w_2)$ and $\mathbf{w}^0 = (w^0_1,w^0_2)$ the equation of the tangent hyperplane can be written more compactly as

\begin{equation} h(\mathbf{w}) = g(\mathbf{w}^0) + \nabla g(\mathbf{w}^0)^T(\mathbf{w} - \mathbf{w}^0) \end{equation}

which is the direct generalization in higher dimensions of the formula for a tangent line defined by the derivative given above.

© This material is not to be distributed, copied, or reused without written permission from the authors.